AITopics | data pipeline

Collaborating Authors

data pipeline

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ACT as Human: Multimodal Large Language Model Data Annotation with Critical Thinking

Lin, Lequan, Shi, Dai, Han, Andi, Chen, Feng, Chen, Qiuzheng, Li, Jiawen, Li, Zhaoyang, Li, Jiyuan, Sun, Zhenbang, Gao, Junbin

arXiv.org Artificial IntelligenceNov-14-2025

Supervised learning relies on high-quality labeled data, but obtaining such data through human annotation is both expensive and time-consuming. Recent work explores using large language models (LLMs) for annotation, but LLM-generated labels still fall short of human-level quality. To address this problem, we propose the Annotation with Critical Thinking (ACT) data pipeline, where LLMs serve not only as annotators but also as judges to critically identify potential errors. Human effort is then directed towards reviewing only the most "suspicious" cases, significantly improving the human annotation efficiency. Our major contributions are as follows: (1) ACT is applicable to a wide range of domains, including natural language processing (NLP), computer vision (CV), and multimodal understanding, by leveraging multimodal-LLMs (MLLMs). (2) Through empirical studies, we derive 7 insights on how to enhance annotation quality while efficiently reducing the human cost, and then translate these findings into user-friendly guidelines. (3) We theoretically analyze how to modify the loss function so that models trained on ACT data achieve similar performance to those trained on fully human-annotated data. Our experiments show that the performance gap can be reduced to less than 2% on most benchmark datasets while saving up to 90% of human costs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2511.09833

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Proportion and Perspective Control for Flow-Based Image Generation

Boudier, Julien, Caselles-Dupré, Hugo

arXiv.org Artificial IntelligenceOct-28-2025

While modern text-to-image diffusion models generate high-fidelity images, they offer limited control over the spatial and geometric structure of the output. To address this, we introduce and evaluate two ControlNets specialized for artistic control: (1) a proportion ControlNet that uses bounding boxes to dictate the position and scale of objects, and (2) a perspective ControlNet that employs vanishing lines to control the 3D geometry of the scene. We support the training of these modules with data pipelines that leverage vision-language models for annotation and specialized algorithms for conditioning image synthesis. Our experiments demonstrate that both modules provide effective control but exhibit limitations with complex constraints. Both models are released on HuggingFace: https://huggingface.co/obvious-research

artificial intelligence, controlnet, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2510.21763

Country: Europe > France > Île-de-France > Paris > Paris (0.05)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.41)

Add feedback

Safe, Untrusted, "Proof-Carrying" AI Agents: toward the agentic lakehouse

Tagliabue, Jacopo, Greco, Ciro

arXiv.org Artificial IntelligenceOct-13-2025

Starting from this prototype, we conclude by outlining practical next steps for a full agentic lakehouse. The paper is organized as follows. After reviewing agent-friendly abstractions (Section II), we address key safety objections for high-stakes scenarios (Section III). Once safety is established, we describe a ReAct [12] loop built on these abstractions (Section IV). We put forward our working prototype as a feasibility demonstration of safe-by-design data agents, not as a full-fledged experimental benchmark. We believe that sharing working code is of great value to the community, especially in times of quickly shifting mental models. However, it is important to remember that our fundamental insights - programmability and safety - can be replicated independently of the chosen APIs. For these reasons, we believe our paper to be valuable to a wide range of practitioners: on one hand, those looking for a new mental map of this uncharted territory; on the other, those looking to be inspired by tinkering with existing implementations and inspecting systems working at scale.

data mining, large language model, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.09567

Country:

North America > United States (0.05)
Europe > Germany (0.04)

Genre: Research Report (0.42)

Technology:

Information Technology > Data Science > Data Mining (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.65)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.49)

Add feedback

InstaGeo: Compute-Efficient Geospatial Machine Learning from Data to Deployment

Yusuf, Ibrahim Salihu, Houndayi, Iffanice, Oualha, Rym, Cherif, Mohamed Aziz, Panford-Quainoo, Kobby, Pretorius, Arnu

arXiv.org Artificial IntelligenceOct-8-2025

Open-access multispectral imagery from missions like Landsat 8-9 and Sentinel-2 has fueled the development of geospatial foundation models (GFMs) for humanitarian and environmental applications. Yet, their deployment remains limited by (i) the absence of automated geospatial data pipelines and (ii) the large size of fine-tuned models. Existing GFMs lack workflows for processing raw satellite imagery, and downstream adaptations often retain the full complexity of the original encoder. We present InstaGeo, an open-source, end-to-end framework that addresses these challenges by integrating: (1) automated data curation to transform raw imagery into model-ready datasets; (2) task-specific model distillation to derive compact, compute-efficient models; and (3) seamless deployment as interactive web-map applications. Using InstaGeo, we reproduced datasets from three published studies and trained models with marginal mIoU differences of -0.73 pp for flood mapping, -0.20 pp for crop segmentation, and +1.79 pp for desert locust prediction. The distilled models are up to 8x smaller than standard fine-tuned counterparts, reducing FLOPs and CO2 emissions with minimal accuracy loss. Leveraging InstaGeo's streamlined data pipeline, we also curated a larger crop segmentation dataset, achieving a state-of-the-art mIoU of 60.65%, a 12 pp improvement over prior baselines. Moreover, InstaGeo enables users to progress from raw data to model deployment within a single working day. By unifying data preparation, model compression, and deployment, InstaGeo transforms research-grade GFMs into practical, low-carbon tools for real-time, large-scale Earth observation. This approach shifts geospatial AI toward data quality and application-driven innovation. Source code, datasets, and model checkpoints are available at: https://github.com/instadeepai/InstaGeo-E2E-Geospatial-ML.git

data quality, instageo, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2510.05617

Genre: Research Report (0.52)

Industry: Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.34)

Technology:

Information Technology > Data Science > Data Quality (0.54)
Information Technology > Artificial Intelligence > Machine Learning (0.51)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.34)

Add feedback

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Zheng, Tong, Zhang, Hongming, Yu, Wenhao, Wang, Xiaoyang, Dai, Runpeng, Liu, Rui, Bao, Huiwen, Huang, Chengsong, Huang, Heng, Yu, Dong

arXiv.org Artificial IntelligenceSep-15-2025

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

large language model, machine learning, reinforcement learning, (15 more...)

arXiv.org Artificial Intelligence

2509.0798

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > North Carolina (0.04)
North America > United States > Maryland > Prince George's County > College Park (0.04)
(2 more...)

Genre: Research Report > Promising Solution (0.48)

Industry: Energy (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

Add feedback

Describe Anything: Detailed Localized Image and Video Captioning

Lian, Long, Ding, Yifan, Ge, Yunhao, Liu, Sifei, Mao, Hanzi, Li, Boyi, Pavone, Marco, Liu, Ming-Yu, Darrell, Trevor, Yala, Adam, Cui, Yin

arXiv.org Artificial IntelligenceApr-23-2025

Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 7 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.

caption, large language model, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2504.16072

Genre: Research Report (0.40)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Automated Planning for Optimal Data Pipeline Instantiation

Amado, Leonardo Rosa, Vogel, Adriano, Griebler, Dalvan, Licks, Gabriel Paludo, Simon, Eric, Meneguzzi, Felipe

arXiv.org Artificial IntelligenceMar-16-2025

Data pipeline frameworks provide abstractions for implementing sequences of data-intensive transformation operators, automating the deployment and execution of such transformations in a cluster. Deploying a data pipeline, however, requires computing resources to be allocated in a data center, ideally minimizing the overhead for communicating data and executing operators in the pipeline while considering each operator's execution requirements. In this paper, we model the problem of optimal data pipeline deployment as planning with action costs, where we propose heuristics aiming to minimize total execution time. Experimental results indicate that the heuristics can outperform the baseline deployment and that a heuristic based on connections outperforms other strategies.

artificial intelligence, operator, planning & scheduling, (15 more...)

arXiv.org Artificial Intelligence

2503.12626

Country:

South America > Brazil > Rio Grande do Sul (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
North America > United States > New York > New York County > New York City (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Industry: Information Technology > Services (0.34)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Planning & Scheduling (1.00)

Add feedback

V2X-LLM: Enhancing V2X Integration and Understanding in Connected Vehicle Corridors

Wu, Keshu, Li, Pei, Zhou, Yang, Gan, Rui, You, Junwei, Cheng, Yang, Zhu, Jingwen, Parker, Steven T., Ran, Bin, Noyce, David A., Tu, Zhengzhong

arXiv.org Artificial IntelligenceMar-3-2025

The advancement of Connected and Automated Vehicles (CAVs) and Vehicle-to-Everything (V2X) offers significant potential for enhancing transportation safety, mobility, and sustainability. However, the integration and analysis of the diverse and voluminous V2X data, including Basic Safety Messages (BSMs) and Signal Phase and Timing (SPaT) data, present substantial challenges, especially on Connected Vehicle Corridors. These challenges include managing large data volumes, ensuring real-time data integration, and understanding complex traffic scenarios. Although these projects have developed an advanced CAV data pipeline that enables real-time communication between vehicles, infrastructure, and other road users for managing connected vehicle and roadside unit (RSU) data, significant hurdles in data comprehension and real-time scenario analysis and reasoning persist. To address these issues, we introduce the V2X-LLM framework, a novel enhancement to the existing CV data pipeline. V2X-LLM leverages Large Language Models (LLMs) to improve the understanding and real-time analysis of V2X data. The framework includes four key tasks: Scenario Explanation, offering detailed narratives of traffic conditions; V2X Data Description, detailing vehicle and infrastructure statuses; State Prediction, forecasting future traffic states; and Navigation Advisory, providing optimized routing instructions. By integrating LLM-driven reasoning with V2X data within the data pipeline, the V2X-LLM framework offers real-time feedback and decision support for traffic management. This integration enhances the accuracy of traffic analysis, safety, and traffic optimization. Demonstrations in a real-world urban corridor highlight the framework's potential to advance intelligent transportation systems.

intersection, park st, vehicle, (16 more...)

arXiv.org Artificial Intelligence

2503.02239

Country:

Asia > China (0.46)
North America > United States > Wisconsin > Dane County > Madison (0.15)
North America > United States > Texas > Brazos County > College Station (0.14)
(2 more...)

Genre: Research Report (1.00)

Industry:

Transportation > Ground > Road (1.00)
Consumer Products & Services > Travel (0.68)
Transportation > Infrastructure & Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Architecture > Real Time Systems (1.00)

Add feedback

LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models

Kim, Yungi, Ha, Hyunsoo, Yang, Seonghoon, Lee, Sukyung, Kim, Jihoo, Park, Chanjun

arXiv.org Artificial IntelligenceNov-18-2024

Creating high-quality, large-scale datasets for large language models (LLMs) often relies on resource-intensive, GPU-accelerated models for quality filtering, making the process time-consuming and costly. This dependence on GPUs limits accessibility for organizations lacking significant computational infrastructure. To address this issue, we introduce the Lightweight, Purpose-driven (LP) Data Pipeline, a framework that operates entirely on CPUs to streamline the processes of dataset extraction, filtering, and curation. Based on our four core principles, the LP Data Pipeline significantly reduces preparation time and cost while maintaining high data quality. Importantly, our pipeline enables the creation of purpose-driven datasets tailored to specific domains and languages, enhancing the applicability of LLMs in specialized contexts. We anticipate that our pipeline will lower the barriers to LLM development, enabling a wide range of organizations to access LLMs more easily.

data pipeline, large language model, natural language, (4 more...)

arXiv.org Artificial Intelligence

2411.11289

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Shi, Ziji, Li, Jialin, You, Yang

arXiv.org Artificial IntelligenceNov-6-2024

Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.

artificial intelligence, machine learning, paragan, (17 more...)

arXiv.org Artificial Intelligence

2411.03999

Country:

North America > United States > Washington > King County > Redmond (0.06)
Asia > Singapore > Central Region > Singapore (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.64)

Industry:

Information Technology (0.46)
Media (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback